17 research outputs found
Conversion of a Russian dependency treebank into HPSG derivations
Proceedings of the Ninth International Workshop
on Treebanks and Linguistic Theories.
Editors: Markus Dickinson, Kaili Müürisep and Marco Passarotti.
NEALT Proceedings Series, Vol. 9 (2010), 7-18.
© 2010 The editors and contributors.
Published by
Northern European Association for Language
Technology (NEALT)
http://omilia.uio.no/nealt .
Electronically published at
Tartu University Library (Estonia)
http://hdl.handle.net/10062/15891
Language models, surprisal and fantasy in Slavic intercomprehension
In monolingual human language processing, the predictability of a word given its surrounding sentential context is crucial. With regard to receptive multilingualism, it is unclear to what extent predictability in context interplays with other linguistic factors in understanding a related but unknown language – a process called intercomprehension. We distinguish two dimensions influencing processing effort during intercomprehension: surprisal in sentential context and linguistic distance. Based on this hypothesis, we formulate expectations regarding the difficulty of designed experimental stimuli and compare them to the results from think-aloud protocols of experiments in which Czech native speakers decode Polish sentences by agreeing on an appropriate translation. On the one hand, orthographic and lexical distances are reliable predictors of linguistic similarity. On the other hand, we obtain the predictability of words in a sentence with the help of trigram language models. We find that linguistic distance (encoding similarity) and in-context surprisal (predictability in context) appear to be complementary, with neither factor outweighing the other, and that our distinguishing of these two measurable dimensions is helpful in understanding certain unexpected effects in human behaviour
Cross-Domain Adaptation of Spoken Language Identification for Related Languages: The Curious Case of Slavic Languages
State-of-the-art spoken language identification (LID) systems, which are
based on end-to-end deep neural networks, have shown remarkable success not
only in discriminating between distant languages but also between
closely-related languages or even different spoken varieties of the same
language. However, it is still unclear to what extent neural LID models
generalize to speech samples with different acoustic conditions due to domain
shift. In this paper, we present a set of experiments to investigate the impact
of domain mismatch on the performance of neural LID systems for a subset of six
Slavic languages across two domains (read speech and radio broadcast) and
examine two low-level signal descriptors (spectral and cepstral features) for
this task. Our experiments show that (1) out-of-domain speech samples severely
hinder the performance of neural LID models, and (2) while both spectral and
cepstral features show comparable performance within-domain, spectral features
show more robustness under domain mismatch. Moreover, we apply unsupervised
domain adaptation to minimize the discrepancy between the two domains in our
study. We achieve relative accuracy improvements that range from 9% to 77%
depending on the diversity of acoustic conditions in the source domain.Comment: To appear in INTERSPEECH 202
On the Correlation of Context-Aware Language Models With the Intelligibility of Polish Target Words to Czech Readers
This contribution seeks to provide a rational probabilistic explanation for the intelligibility
of words in a genetically related language that is unknown to the reader, a phenomenon
referred to as intercomprehension. In this research domain, linguistic distance, among
other factors, was proved to correlate well with the mutual intelligibility of individual words.
However, the role of context for the intelligibility of target words in sentences was subject
to very few studies. To address this, we analyze data from web-based experiments in
which Czech (CS) respondents were asked to translate highly predictable target words at
the final position of Polish sentences. We compare correlations of target word intelligibility
with data from 3-g language models (LMs) to their correlations with data obtained from
context-aware LMs. More specifically, we evaluate two context-aware LM architectures:
Long Short-Term Memory (LSTMs) that can, theoretically, take infinitely long-distance
dependencies into account and Transformer-based LMs which can access the whole
input sequence at the same time. We investigate how their use of context affects surprisal
and its correlation with intelligibility
grammatical relations, and diathetic paradigm
Three syntactic representation levels are de fact
Arguments, Grammatical Relations, and Diathetic Paradigm
this paper argues for the general notion of dependents in HPSG, in addition to arguments and subcategorized elements (valence). It attempts to provide a systematic inventory of ARG-ST / DEPS mappings which results in a diathetic paradigm. The approach offers 41 an insightful cross-linguistic and cross-constructional perspective. It is important to realize that DEPS is not only an enriched level of argument structure, it is a part of diathesis of predicator
Proceedings of COLING’2000, Vol.1, pages 28-34 An ontology of systematic relations for a shared grammar of Slavic
Sharing portions of grammars across languages greatly reduces the costs of multilingual grammar engineering. Related languages share a much wider range of linguistic information than typically assumed in standard multilingual grammar architectures. Taking grammatical relatedness seriously, we are particularly interested in designing linguistically motivated grammatical resources for Slavic languages to be used in applied and theoretical computational linguistics. In order to gain the perspective of a language-family oriented grammar design, we consider an array of systematic relations that can hold between syntactical units. While the categorisation of primitive linguistic entities tends to be language-specific or even construction-specific, the relations holding between them allow various degrees of abstraction. On the basis of Slavic data, we show how a domain ontology conceptualising morphosyntactic "building blocks " can serve as a basis of a shared grammar of Slavic
Gaining the Perspective of a Language Family Oriented Grammar Design: Special Predicative Clitics in Slavic *
On the abstraction level of shared grammar, a common Slavic inventory of special predicative clitics can be postulated in terms of feature specifications referring to information on TYPE, CASE and INDEX (the latter encompassing person, number and gender). This inventory is subject to parameterisation across Slavic languages. We argue that such an approach can contribute considerably to formalising clitic typology.